Jypyter Notebook Best Practice
Jypyter Notebook Best Practice
版本控制
同时提交 py 和 html 文件,方便进行 codereview。
通过 jupyter-notebook 的 FileContentsManager.post_save_hook 实现,添加如下代码到 jupyter-notebook 配置文件:
import os
from subprocess import check_call
def post_save(model, os_path, contents_manager):
    """post-save hook for converting notebooks to .py scripts"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    d, fname = os.path.split(os_path)
    check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)
    check_call(['jupyter', 'nbconvert', '--to', 'html', fname], cwd=d)
c.FileContentsManager.post_save_hook = post_save
目录结构/文件命名
- develop # (Lab-notebook style)
 + [ISO 8601 date]-[DS-initials]-[2-4 word description].ipynb
 + 2015-06-28-jw-initial-data-clean.html
 + 2015-06-28-jw-initial-data-clean.ipynb
 + 2015-06-28-jw-initial-data-clean.py
 + 2015-07-02-jw-coal-productivity-factors.html
 + 2015-07-02-jw-coal-productivity-factors.ipynb
 + 2015-07-02-jw-coal-productivity-factors.py
- deliver # (final analysis, code, presentations, etc)
 + Coal-mine-productivity.ipynb
 + Coal-mine-productivity.html
 + Coal-mine-productivity.py
- figures
 + 2015-07-16-jw-production-vs-hours-worked.png
- src # (modules and scripts)
 + init.py
 + load_coal_data.py
 + figures # (figures and plots)
 + production-vs-number-employees.png
 + production-vs-hours-worked.png
- data (backup-separate from version control)
 + coal_prod_cleaned.csv
There are many benefits to this workflow and structure. The first and primary one is that they create a historical record of how the analysis progressed. It’s also easily searchable:
- by date (ls 2015-06*.ipynb)
 - by author (ls 2015-jw-.ipynb)
 - by topic (ls -coal-.ipynb)
 
数据加载
避免在不同位置多次导入同一份数据。通过 head 函数导入部分数据:
dframe = pd.read_csv('./data/titanic-data.csv')
dframe.head()
Qgrid
grid - SlickGrid in Jupyter Notebooks

注:qgrid 包对于 ipywidgets 和 jupyter notebook 的版本存在严格的依赖关系,盲目使用最新版本可能导致 qgrid 不可用
ast_node_interactivity
当在同一个 code cell 中同时使用 dataframe 的 head、tail 等多个方法时,Jupyter Notebook 的默认行为是只显示其中的最后一个。如果需要将他们同时显示出来,则需要修改如下配置:
# see value of statements at once
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
效果如下:

插件
Jupyter-contrib extensions
Jupyter-contrib extensions is a family of extensions which give Jupyter a lot more functionality, including e.g. jupyter spell-checker and code-formatter.
pip install https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tarball/master
pip install jupyter_nbextensions_configurator
jupyter contrib nbextension install --user
jupyter nbextensions_configurator enable --user

Create a presentation
pip install RISE
jupyter-nbextension install rise --py --sys-prefix
jupyter-nbextension enable rise --py --sys-prefix

Tips
- 通过 
_+ 数字访问之前的 cell 的输出,例如: 

- 通过 
_!+ 数字访问之前 cell 的输入,例如: 

- put imports at the top of the Notebook
 - 通过 
?+ 函数名获取 docstring 中的文档信息 

- magic commands 
%: line command,%%: cell command - 通过 
%env控制环境变量而无需重启 jupyter server - 通过 
%load从外部向当前 code cell 中加载代码,可以为路径或 url - 通过 
%store在两个 notebook 之间传递数据 - 通过 
%who type获取当前 notebook 中的全局变量,type 可以为 str 等数据类型 - 使用 
%timeit或%%time测试代码的运行耗时 - Suppress the output of final function with a semicolon at the end
 - 通过 
!ls方式执行 shell 命令或使用%%bash将解释器变更为 bash。甚至可以以如下方式使用: 

References
- https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
 - https://svds.com/jupyter-notebook-best-practices-for-data-science/
 - http://blog.juliusschulz.de/blog/ultimate-ipython-notebook
 - https://www.dataquest.io/blog/how-to-setup-a-data-science-blog/
 
This blog is under a CC BY-NC-SA 3.0 Unported License
Link to this article: https://dragonkid.github.io/2017/10/12/Jypyter Notebook Best Practice/